Me: Chris Hausler

Today - Pandas and Scikit-Learn

And a lot of firsts

first MPUG meeting... Hi
first presentation using IPython Notebook

Python Data Analysis Library (pandas)

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Creates somthing similar to R DataFrames.. but better

I think it's great, but I'm still a bit clumsy with it .. also the doco is still a little hit and miss

Some imports



In [1]:

    
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)



In [2]:

    
import numpy as np
import pandas as pd
import pylab as plt
import matplotlib 
%matplotlib inline
pd.__version__









    Out[2]:





'0.13.1'

`pandas` has two main data structures: `Series` and `DataFrame`

Series - Like a one dimensional array but better



In [3]:

    
values = [5,3,4,8,2,9]
vals = pd.Series(values)
vals









    Out[3]:





0    5
1    3
2    4
3    8
4    2
5    9
dtype: int64

Each value is now associated with an index. The index itself is an object of class Index and can be manipulated directly.



In [4]:

    
vals.index









    Out[4]:





Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')



In [5]:

    
vals.values









    Out[5]:





array([5, 3, 4, 8, 2, 9])



In [6]:

    
vals * 2.5









    Out[6]:





0    12.5
1     7.5
2    10.0
3    20.0
4     5.0
5    22.5
dtype: float64

We can give named indexes



In [7]:

    
vals2 = pd.Series(values, index=['tom','sally','jeff','george','pablo','florence'])
vals2









    Out[7]:





tom         5
sally       3
jeff        4
george      8
pablo       2
florence    9
dtype: int64

And use these to get the data we want



In [8]:

    
vals2[['florence','tom']]









    Out[8]:





florence    9
tom         5
dtype: int64



In [9]:

    
vals2[['florence','tom','kate']]









    Out[9]:





florence     9
tom          5
kate       NaN
dtype: float64

Dealing with missing values



In [10]:

    
vals3 = vals2[['tom','sally','pablo','florence','ricky','katrin']]
vals3









    Out[10]:





tom          5
sally        3
pablo        2
florence     9
ricky      NaN
katrin     NaN
dtype: float64

Get rid of them



In [11]:

    
vals3.dropna()









    Out[11]:





tom         5
sally       3
pablo       2
florence    9
dtype: float64

Fill them with a value



In [12]:

    
vals3.fillna(0)









    Out[12]:





tom         5
sally       3
pablo       2
florence    9
ricky       0
katrin      0
dtype: float64

Fill them with a calculated value



In [13]:

    
vals3.fillna(vals3.mean())









    Out[13]:





tom         5.00
sally       3.00
pablo       2.00
florence    9.00
ricky       4.75
katrin      4.75
dtype: float64

Use a function like forward fill



In [14]:

    
vals3.fillna(method='ffill')









    Out[14]:





tom         5
sally       3
pablo       2
florence    9
ricky       9
katrin      9
dtype: float64

A handy way to get a picture of our data



In [15]:

    
vals3.describe()









    Out[15]:





count    4.000000
mean     4.750000
std      3.095696
min      2.000000
25%      2.750000
50%      4.000000
75%      6.000000
max      9.000000
dtype: float64

DataFrame - Like a 2D array... with bells and whistles



In [16]:

    
vals.index=pd.Index(['tom','sally','pablo','florence','ricky','katrin'])
vals3=vals3[['tom','sally','pablo','florence','billy','katrin']]



In [17]:

    
# create a dataframe
dat = pd.DataFrame({'orig':vals,'new':vals3})
dat









    Out[17]:






  
    
      
      new
      orig
    
  
  
    
      billy
      NaN
      NaN
    
    
      florence
        9
        8
    
    
      katrin
      NaN
        9
    
    
      pablo
        2
        4
    
    
      ricky
      NaN
        2
    
    
      sally
        3
        3
    
    
      tom
        5
        5
    
  

7 rows × 2 columns

Check for nulls



In [18]:

    
dat.isnull()









    Out[18]:






  
    
      
      new
      orig
    
  
  
    
      billy
        True
        True
    
    
      florence
       False
       False
    
    
      katrin
        True
       False
    
    
      pablo
       False
       False
    
    
      ricky
        True
       False
    
    
      sally
       False
       False
    
    
      tom
       False
       False
    
  

7 rows × 2 columns

Drop rows with nulls



In [19]:

    
dat.dropna()









    Out[19]:






  
    
      
      new
      orig
    
  
  
    
      florence
       9
       8
    
    
      pablo
       2
       4
    
    
      sally
       3
       3
    
    
      tom
       5
       5
    
  

4 rows × 2 columns

Timeseries with pandas DataFrames - a winning combination

Data from google trends.. what correlates (+ve & -ve) with the search term Hipster

Read hipster correlations from a csv file

Pandas supports many file formats for read and write including

csv
json
pickle
the clipboard



In [20]:

    
hipster = pd.read_csv('hipster.csv')
hipster[:10]









    Out[20]:






  
    
      
      Date
      hipster
      modcloth
      gumtree perth
    
  
  
    
      0
       2004-01-04
      -0.976
      -0.817
      -0.844
    
    
      1
       2004-01-11
      -0.816
      -0.817
      -0.844
    
    
      2
       2004-01-18
      -0.837
      -0.817
      -0.844
    
    
      3
       2004-01-25
      -0.976
      -0.817
      -0.844
    
    
      4
       2004-02-01
      -0.722
      -0.817
      -0.844
    
    
      5
       2004-02-08
      -0.795
      -0.817
      -0.844
    
    
      6
       2004-02-15
      -0.723
      -0.817
      -0.844
    
    
      7
       2004-02-22
      -0.713
      -0.817
      -0.844
    
    
      8
       2004-02-29
      -0.786
      -0.817
      -0.844
    
    
      9
       2004-03-07
      -0.675
      -0.817
      -0.844
    
  

10 rows × 4 columns

Set the index to a datetime



In [21]:

    
hipster = hipster.set_index(pd.DatetimeIndex(hipster.pop('Date')))
hipster[:10]









    Out[21]:






  
    
      
      hipster
      modcloth
      gumtree perth
    
  
  
    
      2004-01-04
      -0.976
      -0.817
      -0.844
    
    
      2004-01-11
      -0.816
      -0.817
      -0.844
    
    
      2004-01-18
      -0.837
      -0.817
      -0.844
    
    
      2004-01-25
      -0.976
      -0.817
      -0.844
    
    
      2004-02-01
      -0.722
      -0.817
      -0.844
    
    
      2004-02-08
      -0.795
      -0.817
      -0.844
    
    
      2004-02-15
      -0.723
      -0.817
      -0.844
    
    
      2004-02-22
      -0.713
      -0.817
      -0.844
    
    
      2004-02-29
      -0.786
      -0.817
      -0.844
    
    
      2004-03-07
      -0.675
      -0.817
      -0.844
    
  

10 rows × 3 columns

Now load the anti-Hipster data



In [22]:

    
not_hipster = pd.read_csv('negative-hipster.csv')
not_hipster = not_hipster.set_index(pd.DatetimeIndex(not_hipster.pop('Date')))



In [23]:

    
not_hipster[:10]









    Out[23]:






  
    
      
      yellow pages
      windows installer
      techno
    
  
  
    
      2004-01-04
       1.341
       0.668
       0.871
    
    
      2004-01-11
       1.239
       1.000
       1.122
    
    
      2004-01-18
       1.022
       0.768
       1.053
    
    
      2004-01-25
       0.923
       0.943
       0.807
    
    
      2004-02-01
       0.904
       0.799
       0.612
    
    
      2004-02-08
       0.786
       0.613
       0.614
    
    
      2004-02-15
       0.729
       0.956
       0.391
    
    
      2004-02-22
       0.537
       0.667
       1.124
    
    
      2004-02-29
       0.534
       1.415
       1.078
    
    
      2004-03-07
       0.229
       0.220
       1.918
    
  

10 rows × 3 columns

Check the values of one column



In [24]:

    
hipster.hipster.head()









    Out[24]:





2004-01-04   -0.976
2004-01-11   -0.816
2004-01-18   -0.837
2004-01-25   -0.976
2004-02-01   -0.722
Name: hipster, dtype: float64

Check another, but get them as an numpy.ndarray



In [25]:

    
hipster['gumtree perth'].values[:20]









    Out[25]:





array([-0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844,
       -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844,
       -0.844, -0.844, -0.844, -0.844])

View the data types, they don't need to be homogenous



In [26]:

    
hipster.dtypes









    Out[26]:





hipster          float64
modcloth         float64
gumtree perth    float64
dtype: object

Joins on indexes are easy!



In [27]:

    
trend = hipster.join(not_hipster, how='inner')
trend.head()









    Out[27]:






  
    
      
      hipster
      modcloth
      gumtree perth
      yellow pages
      windows installer
      techno
    
  
  
    
      2004-01-04
      -0.976
      -0.817
      -0.844
       1.341
       0.668
       0.871
    
    
      2004-01-11
      -0.816
      -0.817
      -0.844
       1.239
       1.000
       1.122
    
    
      2004-01-18
      -0.837
      -0.817
      -0.844
       1.022
       0.768
       1.053
    
    
      2004-01-25
      -0.976
      -0.817
      -0.844
       0.923
       0.943
       0.807
    
    
      2004-02-01
      -0.722
      -0.817
      -0.844
       0.904
       0.799
       0.612
    
  

5 rows × 6 columns

We can check the column names and values



In [28]:

    
trend.columns









    Out[28]:





Index([u'hipster', u'modcloth', u'gumtree perth', u'yellow pages', u'windows installer', u'techno'], dtype='object')



In [29]:

    
trend.values









    Out[29]:





array([[-0.976, -0.817, -0.844,  1.341,  0.668,  0.871],
       [-0.816, -0.817, -0.844,  1.239,  1.   ,  1.122],
       [-0.837, -0.817, -0.844,  1.022,  0.768,  1.053],
       ..., 
       [ 1.142,  1.175,  1.394, -1.69 , -1.77 , -1.836],
       [ 1.187,  1.221,  1.403, -1.706, -1.752, -1.796],
       [ 1.514,  1.216,  1.365, -1.72 , -1.701, -1.883]])

Filtering on date ranges is simple



In [30]:

    
trend['2012-01-01':].head()









    Out[30]:






  
    
      
      hipster
      modcloth
      gumtree perth
      yellow pages
      windows installer
      techno
    
  
  
    
      2012-01-01
       1.411
       1.192
       1.774
      -1.077
      -1.134
      -1.285
    
    
      2012-01-08
       1.513
       1.111
       1.579
      -0.995
      -1.183
      -1.189
    
    
      2012-01-15
       1.523
       1.427
       1.613
      -1.027
      -1.161
      -1.337
    
    
      2012-01-22
       1.600
       1.490
       1.514
      -1.140
      -1.177
      -1.345
    
    
      2012-01-29
       1.459
       1.561
       1.511
      -1.046
      -1.224
      -1.233
    
  

5 rows × 6 columns



In [31]:

    
trend['2012-01-01': '2013-01-01'].tail(3)









    Out[31]:






  
    
      
      hipster
      modcloth
      gumtree perth
      yellow pages
      windows installer
      techno
    
  
  
    
      2012-12-16
       1.645
       1.175
       1.407
      -1.433
      -1.515
      -1.687
    
    
      2012-12-23
       1.591
       1.695
       1.625
      -1.698
      -1.655
      -1.504
    
    
      2012-12-30
       1.596
       1.515
       1.868
      -1.515
      -1.598
      -1.674
    
  

3 rows × 6 columns

We can also grab a single date, or a subset of columns



In [32]:

    
trend.ix['2012-01-01', ['hipster', 'modcloth']]









    Out[32]:





hipster     1.411
modcloth    1.192
Name: 2012-01-01 00:00:00, dtype: float64

Or do some boolean filtering



In [33]:

    
trend[trend.techno < 0].head()









    Out[33]:






  
    
      
      hipster
      modcloth
      gumtree perth
      yellow pages
      windows installer
      techno
    
  
  
    
      2004-04-11
      -0.510
      -0.817
      -0.844
       0.521
       0.301
      -0.270
    
    
      2006-01-29
      -0.838
      -0.817
      -0.844
       1.421
       1.309
      -0.081
    
    
      2006-06-25
      -0.799
      -0.817
      -0.833
       1.142
       1.458
      -0.070
    
    
      2010-01-24
      -0.454
      -0.183
      -0.107
      -0.017
       0.053
      -0.010
    
    
      2010-01-31
      -0.381
      -0.276
      -0.142
       0.187
       0.116
      -0.044
    
  

5 rows × 6 columns

Plotting is built in and easier for dates than matplotlib



In [34]:

    
_ = trend.plot(figsize=(10, 6))
_ = plt.legend(loc='best', ncol=2)

We can also do it for a single column



In [35]:

    
_ = trend.hipster.cumsum().plot()

Or split the columns out to subplots



In [36]:

    
axs = trend.plot(subplots=True, figsize=(10, 10))

Resampling data is also straight forward.



In [37]:

    
# resample by month
trend.resample('M', how='mean').head()









    Out[37]:






  
    
      
      hipster
      modcloth
      gumtree perth
      yellow pages
      windows installer
      techno
    
  
  
    
      2004-01-31
      -0.90125
      -0.817
      -0.844
       1.13125
       0.84475
       0.96325
    
    
      2004-02-29
      -0.74780
      -0.817
      -0.844
       0.69800
       0.89000
       0.76380
    
    
      2004-03-31
      -0.78950
      -0.817
      -0.844
       0.35650
       0.73125
       1.09175
    
    
      2004-04-30
      -0.70400
      -0.817
      -0.844
       0.48125
       0.89125
       0.41950
    
    
      2004-05-31
      -0.81820
      -0.817
      -0.844
       0.34780
       0.62040
       0.72860
    
  

5 rows × 6 columns

and Here by year, but one can do business day, week, month, quarter annual and a bunch of others



In [38]:

    
# resample by year
_ = trend.resample('A', how='mean').plot(figsize=(10, 10))

Other fancy plots include a scatter matrix including a kernel density estimation (KDE)



In [39]:

    
# look at the relations
_ = pd.scatter_matrix(trend, figsize=(12,8), diagonal='kde')

Titanic: Machine Learning from Disaster (kaggle.com)

Load the data, explore it and learn from it



In [40]:

    
df = pd.read_csv('train.csv', header=0)



In [41]:

    
df.head()









    Out[41]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
       1
       0
       3
                                 Braund, Mr. Owen Harris
         male
       22
       1
       0
              A/5 21171
        7.2500
        NaN
       S
    
    
      1
       2
       1
       1
       Cumings, Mrs. John Bradley (Florence Briggs Th...
       female
       38
       1
       0
               PC 17599
       71.2833
        C85
       C
    
    
      2
       3
       1
       3
                                  Heikkinen, Miss. Laina
       female
       26
       0
       0
       STON/O2. 3101282
        7.9250
        NaN
       S
    
    
      3
       4
       1
       1
            Futrelle, Mrs. Jacques Heath (Lily May Peel)
       female
       35
       1
       0
                 113803
       53.1000
       C123
       S
    
    
      4
       5
       0
       3
                                Allen, Mr. William Henry
         male
       35
       0
       0
                 373450
        8.0500
        NaN
       S
    
  

5 rows × 12 columns

Lets look at the data types here (this time they're heterogeneous)



In [42]:

    
df.dtypes









    Out[42]:





PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

We can also get a more verbose summary



In [43]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)

DataFrames can be grouped, like in SQL (it sucked to be a young male on the titanic)



In [44]:

    
df_grouped = df.groupby(['Pclass', 'Sex'])



In [45]:

    
df_grouped[['Age', 'Survived']].mean()









    Out[45]:






  
    
      
      
      Age
      Survived
    
    
      Pclass
      Sex
      
      
    
  
  
    
      1
      female
       34.611765
       0.968085
    
    
      male
       41.281386
       0.368852
    
    
      2
      female
       28.722973
       0.921053
    
    
      male
       30.740707
       0.157407
    
    
      3
      female
       21.750000
       0.500000
    
    
      male
       26.507589
       0.135447
    
  

6 rows × 2 columns

Histograms are straightforward



In [46]:

    
ax = df['Age'].dropna().hist(bins=20, range=(0,100), alpha = .5)
ax.set_xlabel('Age')
ax.set_ylabel('Passenger Count')









    Out[46]:





<matplotlib.text.Text at 0x7576ed0>

So are boxplots



In [47]:

    
bp = df.boxplot(column='Age', by='Pclass', grid=False)
for i in set(df.Pclass):
    y = df.Age[df.Pclass==i].dropna()
    # Add some random "jitter" to the x-axis
    x = np.random.normal(i, 0.04, size=len(y))
    plt.plot(x, y, 'r.', alpha=0.2)

If we want to do some learning on this data.. lets convert gender to a binary numeric



In [48]:

    
df['isFemale'] = df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
df[['Sex','isFemale']].head()









    Out[48]:






  
    
      
      Sex
      isFemale
    
  
  
    
      0
         male
       0
    
    
      1
       female
       1
    
    
      2
       female
       1
    
    
      3
       female
       1
    
    
      4
         male
       0
    
  

5 rows × 2 columns

Find non-numeric columns so we can drop them later



In [49]:

    
drop_cols = df.columns[df.dtypes.map(lambda x: x=='object')]
drop_cols









    Out[49]:





Index([u'Name', u'Sex', u'Ticket', u'Cabin', u'Embarked'], dtype='object')



In [50]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 13 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
isFemale       891 non-null int64
dtypes: float64(2), int64(6), object(5)

Setup our data to learn from



In [51]:

    
X = pd.DataFrame(df[[c for c in df.columns if c != 'Survived']])
X = X.drop(drop_cols, axis=1) 
X = X.drop('PassengerId', axis=1)
y = df.Survived
print X.head()









    



   Pclass  Age  SibSp  Parch     Fare  isFemale
0       3   22      1      0   7.2500         0
1       1   38      1      0  71.2833         1
2       3   26      0      0   7.9250         1
3       1   35      1      0  53.1000         1
4       3   35      0      0   8.0500         0

[5 rows x 6 columns]

Have a quick look at the class distribution



In [52]:

    
y.groupby(y.values).count()









    Out[52]:





0    549
1    342
dtype: int64

and fill in some NaNs for age



In [53]:

    
X['Age'] = X.Age.fillna(X.Age.median())

scikit-learn

Machine Learning in Python

Simple and efficient tools for data mining and data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license

What is Machine Learning?

Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data thanks wikipedia

Prediction with scikit-learn is easy - who will survive?



In [54]:

    
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score as acc



In [55]:

    
# create our classifier
clf = LogisticRegression()
# fit it to the data
clf.fit(X, y)
# and predict
preds = clf.predict(X)
res_acc = acc(y, preds)
print 'Accuracy Score: {:.2f}'.format(res_acc)
print 'Not too bad'









    



Accuracy Score: 0.80
Not too bad

Cross-validation is a fairer performance estimate



In [56]:

    
from sklearn.cross_validation import KFold



In [57]:

    
cv = KFold(n=len(y), n_folds=5, shuffle=True)
preds = np.zeros_like(y)
for train, test in cv:
    clf = LogisticRegression()
    clf.fit(X.ix[train], y.ix[train])
    preds[test] = clf.predict(X.ix[test])
res_acc = acc(y, preds)
print 'Accuracy Score: {:.2f}'.format(res_acc)









    



Accuracy Score: 0.79

And cross-validation can be done more easily



In [58]:

    
# scikits can actually take care of this for us
from sklearn.cross_validation import cross_val_score

# here
clf = LogisticRegression()
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
# to here

print scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))









    



[ 0.81564246  0.76966292  0.75842697  0.83707865  0.78651685]
Accuracy: 0.79 (+/- 0.06)

dealing with categorical data



In [59]:

    
df.Embarked.head()









    Out[59]:





0    S
1    C
2    S
3    S
4    S
Name: Embarked, dtype: object



In [60]:

    
set(df.Embarked.fillna('O'))









    Out[60]:





{'C', 'O', 'Q', 'S'}

Use the LabelEncoder



In [61]:

    
from sklearn import preprocessing
df.Embarked = df.Embarked.fillna('O')
le = preprocessing.LabelEncoder()
le.fit(df.Embarked.values)
le.classes_









    Out[61]:





array(['C', 'O', 'Q', 'S'], dtype=object)



In [62]:

    
X['Embarked'] = le.transform(df.Embarked.values)
X.Embarked.head()









    Out[62]:





0    3
1    0
2    3
3    3
4    3
Name: Embarked, dtype: int64

tuning classifier parameters



In [63]:

    
for C in [0.001, 0.01, 0.1, 1, 10, 100]:
    clf = LogisticRegression(C=C, penalty='l1')
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')    
    print("n_estimators: {:3.3f}\tAccuracy: {:.2f} (+/- {:.2f})"
          .format(C, scores.mean(), scores.std() * 2))









    



n_estimators: 0.001	Accuracy: 0.67 (+/- 0.03)
n_estimators: 0.010	Accuracy: 0.67 (+/- 0.03)
n_estimators: 0.100	Accuracy: 0.79 (+/- 0.05)
n_estimators: 1.000	Accuracy: 0.80 (+/- 0.06)
n_estimators: 10.000	Accuracy: 0.79 (+/- 0.04)
n_estimators: 100.000	Accuracy: 0.79 (+/- 0.04)

Comparing classifiers is easy



In [64]:

    
# normalise the data
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)



In [65]:

    
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.lda import LDA
from sklearn.qda import QDA

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Decision Tree",
         "Random Forest", "AdaBoost", "Naive Bayes", "LDA",
         "QDA", "Logistic Regression"]
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GaussianNB(),
    LDA(),
    QDA(),
    LogisticRegression(class_weight='auto')]



In [66]:

    
# fit each classifier and find the mean performance
res = []
for name, clf in zip(names, classifiers):
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    res.append(scores.mean())



In [67]:

    
import prettyplotlib as ppl
res = np.array(res)
names = np.array(names)
idx = np.argsort(res)[::-1]
fig, ax = plt.subplots(1, figsize=(14, 6))
ppl.bar(ax, np.arange(len(res)), res[idx], annotate=True,
        xticklabels=names[idx], grid='y')
plt.xticks(rotation=30)
_ = ax.set_ylim(res.min() * 0.95, res.max() * 1.05)

Models can be pickled



In [69]:

    
# models can be saved
import pickle
s = pickle.dumps(clf)

And there is a whole lot scikit-learn can do..

###supervised learning
###model evaluation
###unsupervised learning
###feature selection
###feature extraction

by Andreas Mueller

	new	orig
billy	True	True
florence	False	False
katrin	True	False
pablo	False	False
ricky	True	False
sally	False	False
tom	False	False

	Date	hipster	modcloth	gumtree perth
0	2004-01-04	-0.976	-0.817	-0.844
1	2004-01-11	-0.816	-0.817	-0.844
2	2004-01-18	-0.837	-0.817	-0.844
3	2004-01-25	-0.976	-0.817	-0.844
4	2004-02-01	-0.722	-0.817	-0.844
5	2004-02-08	-0.795	-0.817	-0.844
6	2004-02-15	-0.723	-0.817	-0.844
7	2004-02-22	-0.713	-0.817	-0.844
8	2004-02-29	-0.786	-0.817	-0.844
9	2004-03-07	-0.675	-0.817	-0.844

	yellow pages	windows installer	techno
2004-01-04	1.341	0.668	0.871
2004-01-11	1.239	1.000	1.122
2004-01-18	1.022	0.768	1.053
2004-01-25	0.923	0.943	0.807
2004-02-01	0.904	0.799	0.612
2004-02-08	0.786	0.613	0.614
2004-02-15	0.729	0.956	0.391
2004-02-22	0.537	0.667	1.124
2004-02-29	0.534	1.415	1.078
2004-03-07	0.229	0.220	1.918

	hipster	modcloth	gumtree perth	yellow pages	windows installer	techno
2012-01-01	1.411	1.192	1.774	-1.077	-1.134	-1.285
2012-01-08	1.513	1.111	1.579	-0.995	-1.183	-1.189
2012-01-15	1.523	1.427	1.613	-1.027	-1.161	-1.337
2012-01-22	1.600	1.490	1.514	-1.140	-1.177	-1.345
2012-01-29	1.459	1.561	1.511	-1.046	-1.224	-1.233

	hipster	modcloth	gumtree perth	yellow pages	windows installer	techno
2012-12-16	1.645	1.175	1.407	-1.433	-1.515	-1.687
2012-12-23	1.591	1.695	1.625	-1.698	-1.655	-1.504
2012-12-30	1.596	1.515	1.868	-1.515	-1.598	-1.674

	hipster	modcloth	gumtree perth	yellow pages	windows installer	techno
2004-04-11	-0.510	-0.817	-0.844	0.521	0.301	-0.270
2006-01-29	-0.838	-0.817	-0.844	1.421	1.309	-0.081
2006-06-25	-0.799	-0.817	-0.833	1.142	1.458	-0.070
2010-01-24	-0.454	-0.183	-0.107	-0.017	0.053	-0.010
2010-01-31	-0.381	-0.276	-0.142	0.187	0.116	-0.044

	hipster	modcloth	gumtree perth	yellow pages	windows installer	techno
2004-01-31	-0.90125	-0.817	-0.844	1.13125	0.84475	0.96325
2004-02-29	-0.74780	-0.817	-0.844	0.69800	0.89000	0.76380
2004-03-31	-0.78950	-0.817	-0.844	0.35650	0.73125	1.09175
2004-04-30	-0.70400	-0.817	-0.844	0.48125	0.89125	0.41950
2004-05-31	-0.81820	-0.817	-0.844	0.34780	0.62040	0.72860

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	NaN	S

		Age	Survived
Pclass	Sex
1	female	34.611765	0.968085
1	male	41.281386	0.368852
2	female	28.722973	0.921053
2	male	30.740707	0.157407
3	female	21.750000	0.500000
3	male	26.507589	0.135447